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Abstract 

We investigate the problem of scanning and prediction ( "scandiction" , for short) of multi- 
dimensional data arrays. This problem arises in several aspects of image and video processing, 
such as predictive coding, for example, where an image is compressed by coding the error 
sequence resulting from scandicting it. Thus, it is natural to ask what is the optimal method 
to scan and predict a given image, what is the resulting minimum prediction loss, and whether 
there exist specific scandiction schemes which are universal in some sense. 

Specifically, we investigate the following problems: First, modeling the data array as a 
random field, we wish to examine whether there exists a scandiction scheme which is inde- 
pendent of the field's distribution, yet asymptotically achieves the same performance as if this 
distribution was known. This question is answered in the affirmative for the set of all spatially 
stationary random fields and under mild conditions on the loss function. We then discuss the 
scenario where a non-optimal scanning order is used, yet accompanied by an optimal predictor, 
and derive bounds on the excess loss compared to optimal scanning and prediction. 

This paper is the first part of a two-part paper on sequential decision making for multi- 
dimensional data. It deals with clean, noiseless data arrays. The second part deals with noisy 



*The material in this paper was presented in part at the IEEE hitcrnational Symposium on Information Theory, 

Seattle, Washington, United States, July 2006, and the Electricity 2006 convention, Eilat, Israel, November 2006. 
'I' Asaf Cohen and Neri Merhav are with the Department of the Electrical Engineering, Technion - I.I.T., Haifa 

32000, Israel. E-mails: {soofsoof@tx,mcrhav@cc}. technion. ac.il. 

■I-Tsachy Wcissman is with the Department of Electrical Engineering, Stanford University, Stanford, CA 94305, 

USA. E-mail: tsachy@stanford.edu. 



1 



data arrays, namely, with the case where the decision maker observes only a noisy version of 
the data, yet it is judged with respect to the original, clean data. 

Index Terms-Universal scanning, Scandiction, Sequential decision making. Multi-dimensional 
data. Random Field, Individual image. 

1 Introduction 

Consider the problem of sequentially scanning and predicting a multidimensional data array, 
while minimizing a given loss function. Particularly, at each time instant t, 1 < t < \B\, where 
\B\ is the number of sites ("pixels") in the data array, the scandictor chooses a site to be 
visited, denoted by ^t, and gives a prediction. Ft, for the value at that site. Both and Ft 
may depend of the previously observed values - the values at sites to ^'f-i. It then observes 
the true value, Xiji^, suffers a loss l{x^^, Ft), and so on. The goal is to minimize the cumulative 
loss after scandicting the entire data array. 

The problem of sequentially predicting the next outcome of a one-dimensional sequence (or 
any data array with a fixed, predefined, order), xt, based on the previously observed outcomes, 
xi, X2, ■ ■ ■ , xt-i, is well studied. The problem of prediction in multidimensional data arrays (or 
when reordering of the data is allowed), however, has received far less attention. Apart from 
the on-line strategies for the sequential prediction of the data, the fundamental problem of 
scanning it should be considered. We refer to the former problem as the prediction problem, 
where no reordering of the data is allowed, and to the latter as the scandiction problem. 

The scandiction problem mainly arises in image compression, where various methods of 
predictive coding are used (e.g., [1]). In this case, the encoder may be given the freedom 
to choose the actual path over which it traverses the image, and thus it is natural to ask 
which path is optimal in the sense of minimal cumulative prediction loss (which may result in 
maximal compression). The scanning problem also arises in other areas of image processing, 
such as one-dimensional wavelet [2] or median [3] processing of images, where one seeks a 
space-filling curve which facilitates the one-dimensional signal processing of multidimensional 
data, digital halftoning [3], where a space filling curve is sought in order to minimize the 
effect of false contours, and pattern recognition [5], where it is shown that under certain 
conditions, the Bayes risk as well as the optimal decision rule are unchanged if instead of 
the original multidimensional classification problem one transforms the data using a measure- 
preserving space-filling curve and solves a simpler one-dimensional problem. More applications 
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can be found in multidimensional data query [6], [7] and indexing [8], where multidimensional 
data is stored on a one-dimensional storage device, hence a locality-preserving space-filling 
curve is sought in order to minimize the number of continuous read operations required to 
access a multidimensional object, and rendering of three-dimensional graphics [9], [lOj . where 
a rendering sequence which minimizes the number of cache misses is required. 

The above applications have already been considered in the literature, and the benefits 
of not-trivial scanning orders have been proved (see [H], or [12] and [13] which we discuss 
later). Yet, the scandiction problem may have applications that go beyond image scanning 
alone. For example, consider a robot processing various types of products in a warehouse. 
The robot identifies a product using a bar-code or an RFID, and processes it accordingly. 
If the robot could predict the next product to be processed, and prepare for that operation 
while commuting to the product (e.g., prepare an appropriate writing- head and a message to 
be written), then the total processing time could be smaller compared to preparing for the 
operation only after identifying the product. Since different sites in the warehouse my be 
correlated in terms of the various products stored in them, it is natural to ask what is the 
optimal path to scan the entire warehouse in order to achieve minimum prediction error and 
thus minimal processing time. 

In [U], a specific scanning method was suggested by Lempel and Ziv for the lossless com- 
pression of multidimensional data. It was shown that the application of the incremental parsing 
algorithm of [T5] on the one dimensional sequence resulting from the Peano-Hilbert scan yields 
a universal compression algorithm with respect to all finite-state scanning and encoding ma- 
chines. These results where later extended in [16] to the probabilistic setting, where it was 
shown that this algorithm is also universal for any stationary Markov random field [T7] . Using 
the universal quantization algorithm of [18] , the existence of a universal rate-distortion encoder 
was also established. Additional results regarding lossy compression of random fields (via pat- 
tern matching) were given in [19] and [20]. For example, in [20], Kontoyiannis considered a 
lossy encoder which encodes the random field by searching for a D-closest match in a given 
database, and then describing the position in the database. 

While the algorithm suggested in [14] is asymptotically optimal, it may not be the optimal 
compression algorithm for real life images of sizes such as 256 x 256 or 512 x 512. In |12] . 
Memon et. al. considered image compression with a codebook of block scans. Therein, the 
authors sought a scan which minimizes the zero order entropy of the difference image, namely, 
that of the sequence of differences between each pixel and its preceding pixel along the scan. 
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Since this problem is computationally expensive, the authors aimed for a suboptimal scan 
which minimizes the sum of absolute differences. This scan can be seen as a minimum spanning 
tree of a graph whose vertices are the pixels in the image and whose edges weights represent 
the differences (in gray levels) between each pixel and its adjacent neighbors. Although the 
optimal spanning tree can be computed in linear time, encoding it may yield a total bit rate 
which is higher than that achieved with an ordinary raster scan. Thus, the authors suggested 
to use a codebook of scans, and encode each block in the image using the best scan in the 
codebook, in the sense of minimizing the total loss. 

Lossless compression of images was also discussed by Dafner et. al. in [13j . In this work, a 
context-based scan which minimizes the number of edge crossing in the image was presented. 
Similarly to [12] , a graph was defined and the optimal scan was represented through a minimal 
spanning tree. Due to the bit rate required to encode the scan itself the results fall short 
behind [H] for two-dimensional data, yet they are favorable when compared to applying the 
algorithm in [T3] to each frame in a three-dimensional data (assuming the context-based scans 
for each frame in the algorithm of [1^ are similar). 

Note that although the criterion chosen by Memon et. al. in [1^, or by Dafner et. al. in |13j . 
which is to minimize the sum of cumulative (first order) prediction errors or edge crossings, 
is similar to the criterion defined in this work, there are two important differences. First, the 
weights of the edges of the graph should be computed before the computation of the optimal 
(or suboptimal) scan begins, namely, the algorithm is not sequential in the sense of scanning 
and prediction in one pass. Second, the weights of the edges can only represent prediction 
errors of first order predictors (i.e., context of length one), since the prediction error for longer 
context depends on the scan itself - which has not been computed yet. In the context of lossless 
image coding it is also important to mention the work of Memon et. al. in [21], where common 
scanning techniques (such as raster scan, Peano-Hilbert and random scan) were compared in 
terms of minimal cumulative conditional entropy given a finite context (note that for unlimited 
context the cumulative conditional entropy does not depend on the scanning order, as will be 
elaborated on later). The image model was assumed to be an isotropic Gaussian random 
filed. Surprisingly, the results of [21] show that context-based compression techniques based 
on limited context may not gain by using Hilbert scan over raster scan. Note that under a 
different criterion, cumulative squared prediction error, the raster scan is indeed optimal for 
Gaussian fields, as it was shown later in [22], which we discuss next. 

The results of [H] and [16] considered a specific, data independent scan of the data set. 
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Furthermore, even in the works of Memon et. al. [12] or Dafner et. al. [13], where data dependent 
scanning was considered, only Umited prediction methods (mainly, first order predictors) were 
discussed, and the criterion used was minimal total bit rate of the encoded image. However, 
for a general predictor, loss function and random field (or individual image), it is not clear 
what is the optimal scan. This more general scenario was discussed in [22], where Merhav 
and Weissman formally defined the notion of a scandictor, a scheme for both scanning and 
prediction, as well as that of scandictability, the best expected performance on a data array. 
The main result in [22] is the fact that if a stochastic field can be represented autoregressively 
(under a specific scan ^) with a maximum-entropy innovation process, then it is optimally 
scandicted in the way it was created (i.e., by the specific scan ^ and its corresponding optimal 
predictor). 

While defining the yardstick for analyzing the scandiction problem, the work in |22j leaves 
many open challenges. As the topic of prediction is rich and includes elegant solutions to 
various problems, seeking analogous results in the scandiction scenario offers plentiful research 
objectives. 

In Section [S] we consider the case where one strives to compete with a finite set of 
scandictors. Specifically, assume that there exists a probability measure Q which governs the 
data array. Of course, given the probability measure Q and the scandictor set, one can compute 
the optimal scandictor in the set (in some sense which will be defined later). However, we are 
interested in a universal scandictor, which scans the data independently of Q, and yet achieves 
essentially the same performance as the best scandictor in (see [23] for a complete survey of 
universal prediction) . The reasoning behind the actual choice of the scandictor set J- is similar 
to that common in the filtering and prediction literature (e.g., p3] and [25]). On the one hand, 
it should be large enough to cover a wide variety of random fields, in the sense that for each 
random field in the set, at least one scandictor is sufficiently close to the optimal scandictor 
corresponding to that random field. On the other hand, it should be small enough to compete 
with, at an acceptable cost of redundancy. 

At first sight, in order to compete successfully with a finite set of scandictors, i.e., construct 
a universal scandictor, one may try to use known algorithms for learning with expert advice, 
e.g., the exponential weighting algorithm suggested in [26] or the work which followed it. In this 
algorithm, each expert is assigned a weight according to its past performance. By decreasing 
the weight of poorly performing experts, hence preferring the ones proved to perform well 
thus far, one is able to compete with the best expert, having neither any a priori knowledge 
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on the input sequence nor which expert will perform the best. However, in the scandiction 
problem, as each of the experts may use a different scanning strategy, at a given point in time 
each scanner might be at a different site, with different sites as its past. Thus, it is not at all 
guaranteed that one can alternate from one expert to the other. The problem is even more 
involved when the data is an individual image, as no statistical properties of the data can be 
used to facilitate the design or analysis of an algorithm. In fact, the first result in Section 
[3] is a negative one, stating that indeed in the individual image scenario (or under expected 
minimum loss in the stochastic scenario) , it is not possible to successfully compete with any two 
scandictors on any individual image. This negative result shows that the scandiction problem 
is fundamentally different and more challenging than the previously studied problems, such as 
prediction and compression, where competition with an arbitrary finite set of schemes in the 
individual sequence setting is well known to be an attainable goal. However, in Theorem H] of 
Section [31 we show that for the class of spatially stationary random fields, and subject to mild 
conditions on the prediction loss function, one can compete with any finite set of scandictors 
(under minimum expected loss). Furthermore, in Theorem [8l our main result in this section, 
we introduce a universal scandictor for any spatially stationary random field. Section [3] also 
includes almost surely analogues of the above theorems for mixing random fields and basic 
results on cases where universal scandiction of individual images is possible. 

In SectionlU we derive upper bounds on the excess loss incurred when non-optimal scanners 
are used, with optimal prediction schemes. Namely, we consider the scenario where one cannot 
use a universal scandictor (or the optimal scan for a given random field), and instead uses an 
arbitrary scanning order, accompanied by the optimal predictor for that scan. In a sense, the 
results of Section H] can be used to assess the sensitivity of the scandiction performance to the 
scanning order. Furthermore, in Section|3]we also discuss the scenario where the Peano-Hilbert 
scanning order is used, accompanied by an optimal predictor, and derive a bound on the excess 
loss compared to optimal finite state scandiction, which is valid for any individual image and 
any bounded loss function. Section [5] includes a few concluding remarks and open problems. 

In [27], the second part of this two-part paper, we consider sequential decision making 
for noisy data arrays. Namely, the decision maker observes a noisy version of the data, yet, 
it is judged with respect to the clean data. As the clean data is not available, two distinct 
cases are interesting to consider. The first, scanning and filtering, is when Yq,^ is available 
in the estimation of Xq,^, i.e.. Ft depends on Y^^ to Yq,^, where {Y} is the noisy data. The 
second, noisy scandiction, is when the noisy observation at the current site is not available 
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to the decision maker. In both cases, the decision maker cannot evaluate its performance 
precisely, as l{x\^^,Ft) cannot be computed. Yet, many of the results for noisy scandiction 
are extendable from the noiseless case, similarly as results for noisy prediction were extended 
from results for noiseless prediction [28]. The scanning and filtering problem, however, poses 
new challenges and requires the use of new tools and techniques. Thus, in [27], we formally 
define the best achievable performance in these cases, derive bounds on the excess loss when 
non-optimal scanners are used and present universal algorithms. A special emphasis is given 
on the cases of binary random fields corrupted by a binary memoryless channel and real- valued 
fields corrupted by Gaussian noise. 

2 Problem Formulation 

The following notation will be used throughout this paperj^ Let A denote the alphabet, which 
is either discrete or the real line. Let O = A^"^ denote the space of all possible data arrays in 
11^. Although the results in this paper are applicable to any d> 1, for simplicity, we assume 
from now on that d = 2. The extension to d > 2 is straightforward. A probability measure Q 
on $7 is stationary if it is invariant under translations Tj, where for each x ^ Q. and i,j G Z^, 
Ti{x)j = Xjj^i (namely, stationarity means shift invariance). Denote by and Ms{^) 

the spaces of all probability measures and stationary probability measures on 17, respectively. 
Elements of A^(r2), random fields, will be denoted by upper case letters while elements of 0, 
individual data arrays, will be denoted by the corresponding lower case. 

Let V denote the set of all finite subsets of T?. For 1/ G V, denote by Xy the restrictions 
of the data array X to V . For i E 1?, Xi is the random variable corresponding to X at site 
i. Let T^-n be the set of all rectangles of the form V = 1? r\ ([mi,m2] x [ni,?i2]). As a special 
case, denote by Vn the square {0, . . . , n — 1} x {0, . . . , n — 1}. For V C I?, let the interior 
diameter of V be 

i2(T/) = sup{r : 3c s.t. 5(c,r) C y}, (1) 

where B{c,r) is a closed ball (under the /i-norm) of radius r centered at c. Throughout, log(-) 
will denote the natural logarithm, and entropies will be measured in nats. 

Definition 1 ([22]). A scandictor for a finite set of sites i? € V is the following pair {^,F): 

• is a sequence of measurable mappings, : j4*^^ B determining the site to 

For easy reference, we try to follow the notation of [22] whenever possible. 
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Figure 1: A graphical representation of the scandiction process. A scandictor {"^ , F) first chooses 
an initial site \E'i. It then gives its prediction for the value at that site, Fi. After observing the 
true value at \E'i, it suffers a loss /(x*j,Fi), chooses the next site to be visited, '^2{x^J, gives its 
prediction for the value at that site, -F2(a^<i'i), and so on. 

be visited at time t, with the property that 

|^'i,^'2(x*i),^3(xM,i,a;>i,2)...,^|B| (^x^,,...,xvi>|5i_,)} = VxeA^. (2) 

• {-Ft}l=i is a sequence of measurable predictors, where for each t, Ft : A^~^ D deter- 
mines the prediction for the site visited at time t based on the observations at previously 
visited sites, and D is the prediction alphabet. 

We allow randomized scandictors, namely, scandictors such that or {Ft}[Ji can be 

chosen randomly from some set of possible functions. At this point, it is important to note 
that scandictors for infinite data arrays are not considered in this paper. Definition [H and the 
results to follow, consider only scandictors for finite sets of sites, ones which can be viewed 
merely as a reordering of the sites in a finite set B. We will consider, though, the limit as the 
size of the array tends to infinity. A scandictor, such that there exists a finite set of sites B, 
for which there is no deterministic finite point in time by which all sites in B are scanned, is 
not included in the scope of Definition [TJ Figure [1] includes a graphical representation of the 
scandiction process. 

Denote by L(^ p)(xy„) the cumulative loss of {'^,F) over xy^, that is 

L(^,,F){xvJ = ^/ (xvi,„Ft(x^,, . . . ,xq,,_J), (3) 
t=i 
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where I : A x D ^ [0, oo) is a given loss function. Throughout this paper, we assume that 
/(•,•) is non- negative and bounded by Imax < oo. The scandictabihty of a source Q G M.{^1) 
on i? G V is defined by 

where (5_b is the marginal probability measure of X restricted to B and S{B) is the set of all 
possible scandictors for B. The scandictabihty of Q G is defined by 

U{l,Q) = hm U{l,QvJ. (5) 

n— >oo 

By |22l Theorem 1], the limit in ([5]) exists for any Q G A4s{^) and, in fact, for any sequence 
{Bn} of elements of TZ^ for which R[Bn) oo we have 

U{l,Q) = lim U{1,QbJ = inf (6) 

n-*oo BeUfj 

2.1 Finite- Set Scandictability 

It will be constructive to refer to the finite set scandictability as well. Let T = {J^n} be a 
sequence of finite sets of scandictors, where for each n, = A < cxo, and the scandictors 
in J^n are defined for the finite set of sites Vn- A possible scenario is one in which one has 
a set of "scandiction rules", each of which defines a unique scanner for each n, but all these 
scanners comply with the same rule. In this case, = {J-n} can also be viewed as a finite set 
JP" which includes sequences of scandictors. For example, \J-n\ = 2 for all n, where for each 
n, J-'n includes one scandictor which scans the data row-wise and one which scans the data 
column- wise. We may also consider cases in which \J-n\ increases with n (but remains finite 
for every finite n). For Q G Aisi^) and JT = we thus define the finite set scandictability 

of Q as the limit 

UAl, Q) = „lh^ ^^min^^ Eq,^ j^^^^,p){XyJ, (7) 

if it exists. Observe that the sub-additivity property of the scandictability as defined in [22], 
which was fundamental for the existence of the limit in ([5]), does not carry over verbatim to 
finite set scandictability. This is for the following reason. Suppose {^,F) G 5 is the optimal 
scandictor for Xy and {^',F') G 5 is optimal for Xjj (assume V DU = 0). When scanning 
XvuUj one may not be able to apply F) for Xy and then , F') for Xu, as this scandictor 
might not be in S. Hence, we seek a universal scheme which competes successfully (in a sense 
soon to be defined) with a sequence of finite sets of scandictors even when the limit in 

d?!) does not exist. 
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3 Universal Scandiction 



The problem of universal prediction is well studied, with various solutions to both the stochastic 
setting as well as the individual. In this section, we study the problem of universal scandiction. 
Notwithstanding strongly related to its prediction analogue, we first show that this problem 
is fundamentally different in several aspects, mainly due to the enormous degree of freedom 
in choosing the scanning order. Particularly, we first give a negative result, stating that while 
in the prediction problem it is possible to compete with any finite number of predictors and 
on every individual sequence, in the scandiction problem one cannot even compete with any 
two scandictors on a given individual data array. Nevertheless, we show that in the setting 
of stationary random fields, and under the minimum expected loss criterion, it is possible to 
compete with any finite set of scandictors. We then show that the set oi finite- state scandictors 
is capable of achieving the scandictability of any spatially stationary source. In Theorem [HI 
our main result in this section, we give a universal algorithm which achieves the scandictability 
of any spatially stationary source. 

3.1 A Negative Result on Scandiction 

Assume both the alphabet A and the prediction space D are [0, 1]. Let I be any non- degenerated 
loss function, in the sense that prediction of a bernoulli sequence under it results in a positive 
expected loss. As an example, squared or absolute error can be kept in mind, though the 
result below applies to many other loss functions. The following theorem asserts that in the 
individual image scenario, it is not possible to compete successfully with any two arbitrary 
scandictors (it is possible, though, to compete with some scandictor sets, as proved in Section 

ESD. 

Theorem 2. Let A = D = [0,1] and assume I is a non- degenerated loss function. There exist 
two scandictors {^,F)i and {^,F)2 for Vn, such that for any scandictor {^,F) for Vn there 
exists xv„ for which 

L{^,F)ixvJ - min{L(^,i.)^(xy„),L(^^^)2(xy„)} = @i\Vn\). (8) 

In words, there exist two scandictors such that for any third scandictor, there exists an 
individual image for which the redundancy when competing with the two scandictors does not 
vanish. Theorem [2] marks a fundamental difference between the case where reordering of the 
data is allowed, e.g., scanning of multidimensional data or even reordering of one-dimensional 
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data, and the case where there is one natural order for the data. For example, using the 
exponential weighting algorithm discussed earlier, it is easy to show that in the prediction 
problem (i.e., with no scanning), it is possible to compete with any finite set of predictors 
under the above alphabets and loss functions. Thus, although the scandiction problem is 
strongly related to its prediction analogue, the numerous scanning possibilities result in a 
substantially richer and more challenging problem. 

Theorem [2] is a direct application of the lemma below. 

Lemma 3. Let A = D = [0,1] and assume I is a non- degenerated loss function. There exist a 
random field Xv„ and two scandictors F)i and (^', F)2 for Vn, such that for any scandictor 
{^,F)forVn, 

^L(^,^)(XyJ -£;min{L(^,^),(Xv^J,L(^,^),(XyJ} = G(|K|). (9) 

Lemma[3]gives another perspective on the difference between the scandiction and prediction 
scenarios. The lemma asserts that when ordering of the data is allowed, one cannot achieve 
a vanishing redundancy with respect to the expected value of the minimum among a set of 
scandictors. This should be compared to the prediction scenario (no reordering), where one 
can compete successfully not only with respect to the minimum of the expected losses of all 
the predictors, but also with respect to the expected value of the minimum (for example, see 
[291 Corollary 1]). The main result of this section, however, is that for any stationary random 
field and under mild conditions on the loss function, one can compete successfully with any 
finite set of scandictors when the performance criterion is the minimum expected loss. 

Proof. (Lemma [3]) Let Yv„ be a random field such that 1^(1, 1) is distributed uniformly on 
[0,1], and Yv^ \ Y{1,1) = Y{1,2), . . . ,Y{l,n),Y{2,l),Y{2,2), . . . ,Y{n,n) are simply the first 

— 1 bits in the binary representation of y(l, 1) (ordered row- wise). Note that Yy,^ \ y(l, 1) 
are i.i.d. unbiased bits, yet conditioned on 1^(1, 1), they are deterministic and known. Assume 
now that Xy^ is a random cyclic shift of Yy,^, in the same row- wise order Yv„ was created. 

For concreteness, we assume the squared error loss function. In this case, it is easy to 
identify the constant of the G(-) expression in ([8]). However, the computations below are 
easily generalized to other non-degenerated loss functions. We first show that the expected 
cumulative squared error of any scandictor on Xy^ is at least (n^ + l)/8, as the expected 
number of steps until the real valued site is located is (n^ + l)/2, with a loss of 1/4 until that 
time. More specifically, let J be the random number of cyclic shifts, that is, J is uniformly 
distributed on {0, 1, . . . , — 1}. For fixed j, let G be the random index such that "ifc is the 
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real- valued X (i.e., G is the time the real valued random variable is located by the scanner 
Let 4)s denote the Bayes envelope associated with the squared error loss, i.e., 



= mm [{I - p)q^ + p{q - if ]. 
ge[o,i] 



(10) 



For any scandictor (^,F), we have 
EL(^P){Xv„) 



i=G+l 



= Eje\^[x^^-F,{xIi-^)Y + E {x^.-F.{xll-^\ 
( G 

> Eje\^[x^^-f,{xIi~^] 

[i=l 

> EjEl^cP,(^PiX^^\xll-'i 

[ i=i 

= EjE { Gcj), {P{X^, = 1)) 



= (11) 

8 ^ ^ 

On the other hand, consider the expected minimum of the losses of the following two scan- 
dictors: {^,F)i which scandicts Xv„ row-wise from X(l, 1) to X{n,n), and {^,F)2 which 
scandicts Xv„ row-wise from X{n,n) to X(l, 1). Using the same method as in (jll|) . it is 
possible to show that this expected loss is smaller than n^/16-|-o(n^), as the expected number 
of steps until the first locates the real-valued site is {n? -\- l)^/(4n^), after which zero loss 
is incurred. This is since once the real-valued site is located, the rest of the values can be 
calculated by the predictor by cyclic shifting the binary representation of the real-valued pixel. 
This completes the proof. □ 

Proof. (Theorem [2]) By Lemma [31 there exists a stochastic setting under which the expected 
minimum of the losses of two scandictors is smaller than the expected loss of any single 
scandictor. Thus, for any scandictor there exists an individual image on which it cannot 
compete successfully with the two scandictors. □ 
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3.2 Universal Scandiction With Respect to Arbitrary Finite 
Sets 



As mentioned in Section [H straightforward implementation of the exponential weighting algo- 
rithm is not feasible, since one may not be able to alternate from one expert to the other at 
wish. However, the exponential weighting algorithm was found useful in several lossy source 
coding works such as Linder and Lugosi [30], Weissman and Merhav [31], Gyorgy et. al. |32j 
and the derivation of sequential strategies for loss functions with memory [33], all of which 
confronted a similar problem. A common method used in these works, is the alternation of 
experts only once every block of input symbols, necessary to bear the price of this change (e.g., 
transmitting the description of the chosen quantizer [30] -[32]). Thus, although the difficulties 
in these examples differ from those we confront here, the solution suggested therein, which is 
to persist on using the same expert for a significantly long block of data before alternating it, 
was found useful in our universal scanning problem. 

Particularly, we divide the data array into smaller blocks and alternate scandictors only 
each time a new block of data is to be scanned. Unlike the case of sequential prediction dealt 
with in [33], here the scandictors must be restarted each time a new block is scanned, as it is 
not at all guaranteed that all the scandictors scan the data in the same (or any) block-wise 
order (i.e., it is not guaranteed that a scandictor for Vn divides the array to sub-blocks of size 
m X m and scans each of them separately). Hence, in order to prove that it is possible to 
compete with the best scandictor at each stage n, we go through two phases. In the first, 
we prove that an exponential weighting algorithm may be used to compete with the best 
scandictor among those operating in a block-wise order. This part of the proof will refer to 
any given data array (deterministic scenario). In the second phase, we use the stationarity of 
the random field to prove that a block-wise scandictor may perform essentially as well as one 
scanning the data array as a whole. The following theorem stands at the basis of our results, 
establishing the existence of a universal scandictor which competes successfully with any finite 
set of scandictors. 

Theorem 4. Let X be a stationary random field with a probability measure Q. Let T = {Tn} 
be an arbitrary sequence of scandictor sets, where Tn is a set of scandictors for Vn and \ J-'n\ = 
A < CO for all n. Then, there exists a sequence of scandictors {{^ ,F)n}, where {^,F)n is a 
scandictor for Vn, independent of Q, for which 




I 




1 
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for any Q G Ms{^); where the inner expectation in the l.h.s. of (jl2p is due to the possible 
randomization in {^,F)n. 

Before we prove Theorem HI let us discuss an "individual image" type of result, which will 
later be the basis of the proof. Let xy^ denote an individual n x n data array. For m < n, 
define K = |"^] — 1. We divide xv„ into blocks of size mxm and 2K + 1 blocks of possibly 
smaller size. Denote by x*, < i < {K + 1)^ — 1 the i'th block under some fixed scanning 
order of the blocks. Since we will later see that this scanning order is irrelevant in this case, 
assume from now on that it is a (continuous) raster scan from the upper left corner. That is, 
the first line of blocks is scanned left to right, the second line is scanned right to left, and so 
on. We will refer to this scan simply as "raster scan" . 

As mentioned, the suggested algorithm scans the data in xv„ block-wise, that is, it does not 
apply any of the scandictors in J^n, only scandictors from J^m- Omitting m for convenience, 
denote by Lj^i the cumulative loss of {^,F)j G after scanning i blocks, where {'^,F)j is 
restarted after each block, namely, it scans each block separately and independently of the 
other blocks. Note that Lj i = Yyi=o ^ji^^) ^^^^ that for i = 0, Lj i = for all j. Since we 
assumed the scandictors are capable of scanning only square blocks, for the 2K + 1 possibly 
smaller (and not square) blocks the loss may be Imax throughout. For rj > 0, and any i and j, 
define 

where A = \J-m\- We offer the following algorithm for a block-wise scan of the data array 
x. For each < i < {K + 1)^ — 1, after scanning i blocks of data, the algorithm computes 
Pi ^i|{-Z^j,i}j=i) fo'^ each j. It then randomly selects a scandictor according to this distribution, 
independently of its previous selections, and uses this scandictor as its output for the {i + l)-st 
block. Namely, the universal scandictor (^,F)„, promised by Theorem U is the one which 
divides the data to blocks, performs a raster scan of the data block-wise, and uses the above 
algorithm to decide which scandictor out of J-^ to use for each block. 

It is clear that both the block size and the number of blocks should tend to infinity with 
n in order to achieve meaningful results. Thus, we require the following: a. m = m{n) 
tends to infinity, but strictly slower than n, i.e., m{n) = o{n). b. m(n) is an integer- valued 
monotonically increasing function, such that for each K £ Z there exists n such that m{n) = K. 
The results are summarized in the following two propositions, the first of which asserts that 
for m{n) = o{n), vanishing redundancy is indeed possible, while the second asserts that under 
slightly stronger requirements on m{n), this is also true in the a.s. sense (with respect to the 
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random selection of the scandictors in the algorithm). 

Proposition 5. Let Laig{xv„) be the cumulative loss of the proposed algorithm on xv„, and 
denote by Laig{xv„) its expected value, where the expectation is with respect to the randomized 
scandictor selection of the algorithm. Let Lmin denote the cumulative loss of the best scandictor 
in Tm, operating block-wise on xv„- Assume \ J'm\ = A. Then, for any xv„, 

Lalgixvn) - Lmin{xvn) < m{n){n + m(n))yi^^. (14) 

Proposition 6. Assume m{n) = o(n^/'^). Then, for any image xy^, the difference between 
the normalized cumulative loss of the proposed algorithm and that of the best scandictor inj^m, 
operating block-wise, converges to with probability 1 with respect to the randomized scandictor 
selection of the algorithm. 

The proofs of Propositions [5] and [6] are rather technical and are based on the very same 
methods used in [3l] and [33]. See Appendices lA.ll and IA.2I for the details. 

On the more technical side, note that the suggested algorithm has "finite horizon," that 
is, one has to know the size of the image in order to divide it to blocks, and only then can 
the exponential weighting algorithm be used. It is possible to extend the algorithm to infinite 
horizon. The essence of this generalization is in dividing the infinite image into blocks of 
exponentially growing sizj^, and to apply the finite horizon algorithm for each block. We may 
now proceed to the proof of Theorem HI 

Proof of Theorem Since the result of Proposition [5] applies to any individual data array, it 
certainly applies after taking the expectation with respect to Q. Therefore, 

^Qv„ ~ ^Qv„ - —^ImaxV'^^OgX. (15) 

However, remember that we are not interested in competing with Eq^^-ijLmin, as this is the 
performance of the best block-wise scandictor. We wish to compete with the best scandictor 



^For example, take four blocks of size I x I, then three of size 21 x 21, and so on. 
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operating on the entire data array Xv„, that is, on the whole image of size n x n. We have 
1 1 

< min Eq^- ^ y 



(a) 1 
< min ^ 



fi=i 



+ (n - Km{n)fl 



2, 

'max 



< min ^Lj(X°) +2^^^W^, (16) 

where (a) follows from the stationarity of Q, the assumption that each {^,F)j operates in the 
same manner on each m(n) x m{n) block, no matter what its coordinates are, and the fact 
that each {^,F)j may incur maximal loss on non-square rectangles. Prom (llSp and (jlGh . we 
have 

1 - 1 / 0\ m{n) m{n) i— - 

^Qv^:^^alg < EQ^^—^Lj*^ra(n)){X ) + 2——lmax + —^lmax\/ 2\og X 

= ^Qv„^^.-Mn))(^°) + o(^V^) (17) 

where (^f, F)^.^^^^)) is the scandictor achieving the minimum in (jl6p . Finally, by our assump- 
tions on {m(n)}, we have 



< inf ^Eq,,^ _l^..(,)(Xv.J| + -^lmaxi2 + V21ogA). (18) 

Taking the limit as n — > oo and using the fact that m{k) /k ^ together with the arbitrariness 
of k, gives: 

liminf£;Q^^Z„/g < liminf £;Q^^L *(„)(Xy„), (19) 
which completes the proof of ()12p . □ 

It is evident from ()14|) and (jlSp that although the results of Theorem [H and Proposition 
[5] are formulated for fixed A < cxo (the cardinality of the scandictor set), these results hold 
for the more general case of A = A(n), as long as the redundancy vanishes, i.e., as long as 
m(n) = o{n) and A(n) is such that ^^^^ V^og A when n — > oo. The requirement that 
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A(n) = o^e""'"'^ j still allows very large scandictor sets, especially when m{n) grows slowly 
with n. Furthermore, it is evident from equation (jl7p that whenever the redundancy vanishes, 
the statement of Theorem U] is valid with limsup as well ,i.e., 

lim sup Eq^^E-^L,^ {Xv„ ) < lim sup rnin Eq^^ t^^L^^^f) {Xy^ ) ■ (20) 

3.3 Finite-State Scandiction 

Consider now the set of finite-state scandictors, very similar to the set of finite-state encoders 
described in [H]. At time t = 1, a finite-state scandictor starts at an arbitrary initial site ^i, 
with an arbitrary initial state Sq £ S and gives F{sq) as its prediction for Xij,^. Only then it 
observes After observing x^., it computes its next state, Sj, according to Si = g{si-l,x^^,-) 
and advances to the next site, Xiji.^j, according to ^'i+i = + d{si), where g : S x A S 
is the next state function and d : S ^ B is the displacement function, B <Z denoting 
a fixed finite set of possible relative displacements. It then gives its prediction F{si) to the 
value x^il^^-^. Similarly to [H], we assume the alphabet A includes an additional "End of File" 
(EoF) symbol to mark the image edges. The following lemma and the theorem which follows 
establish the fact that the set of finite-state scandictors is indeed rich enough to achieve the 
scandictability of any stationary source, yet not too rich to compete with. 

Lemma 7. Let J-s = {(^i-^)j} set of all finite- state scandictors with at most S states. 

Then, for any Q G Ais(S^), 

lim U^,{l,Q) = U{l,Q). (21) 

That is, the scandictability of any spatially stationary source is asymptotically achieved with 
finite-state scandictors. 

Proof. Take B = Vm and let (^', F)m be the achiever of the infimum in That is, 

1 1 

Since Vm is a rectangle of size m x m, the scandictor {^,F)m is certainly implementable 
with a finite-state machine having S{m) < oo states. In other words, since Vm is finite, any 
scanning rule : A^^^ i-^ B and any prediction rule Ft : yl*~^ ^ A can be implemented with 
a finite-state machine having at most S{m) = A"^^ x states, where in a straightforward 
implementation A"^^ states are required to account for all possible inputs and states are 
required to implement a counter. 
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Now, for an n X n image (assuming now that m divides n, as dealing with the general case 
can be done in the exact same way as ()16p ). we take (^', F')n to be the scandictor which scans 
the image in the block-by-block raster scan described earlier, applying F)m to each mx m 
block. Namely, ^' scans all the blocks in the first m lines left-to-right until it reaches an EoF 
symbol, then moves down m lines, scans all the blocks right-to-left until an EoF is reached, 
and so on. The predictor F' simply implements F for each block separately, i.e., it resets to its 
initial values at the beginning of each block. It is clear that the scanner ^' is implementable 
with a finite-state machine having S{m) = S{m) -|- 2 < cxd states and thus (^', F') G ^s(m)- 

From the stationarity of Q, we have 

Taking the limits limsup^^go and liminf^^oo by (l6|), we have 

U{UQ) < limsup min EQy^^L^^^F){Xv„) 

and 

U(l,Q) < limint^^ mn^^^Eg,, iL,»,r,(X,,J 

The proof is completed (including the existence of the limit in the l.h.s. of (j2ip ) by tak- 
ing m to infinity, applying and remembering that Uj^g{l,Q) is monotone in S, thus the 
convergence of the sub-sequence {L^jf^^^j (?, Q)}^^^ implies the convergence of the sequence 
{U:Fsil,Q)}f=i)- □ 

In words, Lemma [7] asserts that for any m, finite-state machines attain the mxm Bayesian 
scandictability for any stationary random field. Note that the reason such results are accom- 
plishable with FSMs is their ability to scan the entire data, block by block, with a machine 
having no more than S{m) states, regardless of the size of the complete data array. The 
number of the states depends only on the block size. 



3.4 A Universal Scandictor for Any Stationary Random Field 

We now show that a universal scandictor which competes successfully with all finite-state 
machines of the form given in the proof of Lemma[71 does exist and can, in fact, be implemented 
using the exponential weighting algorithm. In order to show that we assume that the alphabet 
A is finite and the prediction space D is either finite or bounded (such as the l-Dl — 1 simplex 
of probability measures on D). In the latter case we further assume that l{x,F) is Lipschitz in 
its second argument for all x, i.e, there exists a constant c such that for all x, F and e we have 
\l{x, F) — l{x, F + e)| < c|e|. The following theorem establishes, under the above assumptions, 
the existence of a universal scandictor for all stationary random fields. 

Theorem 8. Let X be a stationary random field over a finite alphabet A and a probability 
measure Q. Let the prediction space D be either finite or bounded (with l{x,F) then being 
Lipschitz in its second argument). Then, there exists a sequence of scandictors 
independent of Q, for which 



for any Q G Ms{^); where the inner expectation in the l.h.s. of (j26p is due to the possible 
randomization in 

Proof. Assume first that the range D of the predictors {Ft\ is finite. Consider the exponential 
weighting algorithm described in the proof of Theorem^ where at each m(n) x m(n) block the 
algorithm computes the cumulative loss of every possible scandictor for an m(n) x nn{n) block, 
then chooses the best scandictor (according to the exponential weighting regime described 
therein) as its output for the next block. By (fT7|) . we have 



ri- 



hm Eq^^E—L(^^^f)SXv„) = U{l,Q) 



(26) 




(*,F)e-S(K„(„)) 



mm 




1 



(X°) + 0(^V^), (27) 



where S{Vm(n)) is the set of all possible scandictors on m(n) x m(n) and A is the size of that 



set. Since A = A {m{n)), all that is left to check is that the O y \/log A j expression indeed 
decays to zero as n tends to infinity. 



Indeed, the number of possible scanners for a field B over an alphabet A is 



= 11(1^1-^)'^'' 



< (|S|!)I^I 



(28) 
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while the number of possible predictors is 



{{F,}[%F,:A'-'^d}\ = ll\D\\^\' 



k=l 



< (29) 

Thus, using the Stirling approximation, logfc! ~ klogk, in the sense that lim^^oo ^logk ~ 
we have 

n n \ I 

in i Til I 

— ^x/2|^|™W^m(n)2logm(n) + m(n)2M|'^W-i log IDI 

^ ' /|^|™(n)2 logm(n), (30) 



n 

which decays to zero as n ^ oo for any m(n) = o(\/logn). Namely, for m(n) = o(\/log n), 
equation (p7|) results in 

liminf ii^o„ —;r La/o < liminf min £'0,^ — T-rriLiis, p\{X'^), (31) 

and 

limsupii;Q,„i,L.,, < hmsup^^^^ mm^^^^^ (32) 

Since m(n) ^ 00 as n ^ 00, by [22] the limit lim„_oo TQ.m.(^^,^p)^s(V^(^„)) ^Qv^^„^ l^n^^{'^,F)^^^) 
exists and equals the scandictability of the source, U{l,Q). However, by definition, U{l,Q) is 
the best achievable scandiction performance for the source Q, hence, 

liminf^Q^^^Z,;^ > [/(/,Q), (33) 

which results in 

\im EQ,.^^Laig = U{l,Q). (34) 

For the case of infinite (but bounded) range D, similarly to |25j . we use the fact that the 
loss function / is Lipschitz and take an e-approximation of D. We thus have 



+ cm(n)2e(m(n))+o(^^^V'fo^^ (35) 



for some constant c. Choosing e{m{n)) = results in \D\ = ^m{n)^, hence V^og A 



still decays to zero for any m{n) = 0{y/\og n) and ([M]) is still valid. □ 
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Note that the proof of Theorem [8] does not use the weh estabhshed theory of universal 
prediction. Instead, the exponential weighting algorithm is used for all possible scans (within 
a block) as well as all possible predictors. This is since important parts of the work on predic- 
tion in the probabilistic scenario include some assumption on the stationarity of the measure 
governing the process, such as stationarity or asymptotically mean stationarity [35]If| In the 
scandiction scenario, however, the properties of the output sequence are not easy to determine, 
and it is possible, in general, that the output sequence is not stationary or ergodic even if the 
input data array is. Thus, although under certain assumptions, one can use a single universal 
predictor, applied to any scan in a certain set of scans, this is not the case in general. 

3.5 Universal Scandiction for Mixing Random Fields 

The proof of Theorem d] established the universality of (^,-F)„ under the expected cumula- 
tive loss criterion. In order to establish its universality in the Q-a.s. sense, we examine the 
conditions on the measure Q such that the following equality holds. 

1 

i=l 

To this end, we briefly review the conditions for the individual ergodic theorem for general 
dynamical systems given in [37], specialized for Z^. Let {An} be a sequence of subsets of Z^. 
For each n, the set An is the set of sites over which the arithmetical average is taken. Let AAB 
denote the symmetric difference between the sets A and B, AUB \AnB, and remember that 

Condition 1 ([371, El']). For all i£l?, 

lim ' , '\ ^' = 0. (37) 

n-+oo l^^l 

Condition 2 ( [37|, £^3"]). There exists a constant Ci < oo such that for all n, 

\k:k = i-j, i,j G An\ <Ci\An\. (38) 

Condition 3 ( |37|, EA]). There exists a sequence of measurable sets {M„} such that, 

. \k : k = i + j, i e An,j e Mn\ „ . . 

hmmf — — = 62 < oo. (39) 

n— ►oo \Mn\ 

By \37\ Theorem 6.1'], if the sequence {An} satisfies conditions 1-3, then, for any stationary 
random field X with -E|Xo| < 00, we have, 

U-iy.^i = E{Xo\I} Q - a.s., (40) 

n^oo \An\ .f^^ 



^An important exception is the Kalman filter Section 7.7]. 
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where Q is the measure governing X and X is the fi-algebra of invariant sets of 0, that is, 

A(^imTi{A) = A foralHGZ^. (41) 

If Q is ergodic, namely, for each ^ G T, Q{A) G {0,1}, then E{Xq\I} is deterministic and 
equals EXq. 

Clearly, since Lj{x^) depends on a set of sites, with the average in taken over the sets 
An = {i : i = m-j,j £ Vk}, (f36|) may not hold, even if Q is ergodic, as, for example. Condition 
1 is not satisfiedo These two obstacles can be removed by defining an alternative random field, 
X, over the set of sites m ■ 1? = {j : j = m ■ i,i £ Z^}, where Xi equals Lj[X^) and X^ is the 
corresponding m x m block of X. Note that since the loss function /(•) is bounded and m is 
finite, -E|Xo| < oo. It is not hard to see that conditions 1-3 are now satisfied (with the new 
space being m ■ Z^). However, for S{Xo|2m} to be deterministic, where Im is the u-algebra 
of m-invariant sets, 

^ G T„ iff Tj{A) = A for all j = i • m, i G (42) 

it is required that is the trivial fj-algebra. In other words, block ergodicity of Q is required. 

We now show that if the measure Q is strongly mixing, then it is block-ergodic for any 
finite block size. For A,B £Z^, define 

a^{A, B) = sup{|Q(C/ n V) - Q{U)Q{V)lU G ^(Xa), V G (j{Xb)}, (43) 

where cf{Xb) is the smallest sigma algebra generated by Xb- Let Oi^i,{k) denote the strong 
mixing coefficient |38|. Sec. 1.7] of the random field Q 

a'^i^ik) = sup{a^{A, B), \A\ < a, \B\ < b, d{A, B) > k}, (44) 

where d is a metric on and d{A, B) is the distance between the closest points, i.e., d{A, B) = 
mini^AjeB d{i, j). Assume now that Q is strongly mixing in the sense that for all a,b £ 
N U {oo}, Oi^ij{k) — > as /c ^ oo. It is easy to see that Q{A) G {0, 1} for all A G 2^. Indeed, 

^ lim \Q{T,.m{A) n A) - Q{T,.m{A))Q{A)\ = 0, (45) 

(i(i,0)— >oo 

however, since A is m-invariant, Ti.m{A) = A and thus Q{A) = Q{A)'^. Hence Q is m-block 
ergodic for each m (i.e., totally ergodic). 

The following theorem asserts that under the assumption that the random field Q is strongly 
mixing, the results of Theorem [J] apply in the a.s. sense as well. 



'^In fact, Tempelman's work [37| also includes slightly weaker conditions, but neither are satisfied in the current 
setting. 
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Theorem 9. Let X he a stationary strongly mixing random field with a probability measure 
Q. Let T = {J'n} be a sequence of finite sets of scandictors and assume that Uj^{l, Q) exists. 
Then, if the universal algorithm suggested in the proof of Theorem\^ uses a fixed block size m, 
we have 

\\mmi-^Laig{XvJ<U:F{l:Q) + Km) Q-a.s. (46) 
for any such Q and some 5{m) such that 5{m) — > as m —> oo. 

Proof. For each xv„, we have, 



LminixvJ = -r^ min ^ Lj{x') 

1 / . 



1=1 



By Proposition [5l 



rLalgixvJ < jTyrLrainixyJ + Tf^ VlogA— =-. (48) 



Thus, 



i-j^ l^uiyV-^Vn/ — ITT I "irari V"' K„ / i i-j^ i v ^"ti - ■ /tt 

I I'n I \Vn\ \Vn\ y I 



hminf-^Laig(xy„) < hminf -^Lmm(a;y„) 

n— >oo n— >c« 

1 1 

< , hm inf min — ^ -Lj ( 



< min hminf ^y^L, (x*). (49) 



Since X ^ oo as n ^ oo, by the block ergodicity of Q and the fact that for finite m and each 
(^, F)j G Lj[X) is a bounded function, it follows that 

1 

lim^inf-^J]L,(x^) = i?Q^^L,(XO) Q-a.s. (50) 

Finally, since Ujr[l^Q) exists, there exists 5{m) such that 5{m) ^ as m ^ oo and we have 

\unmi -^Laig{xv^) <Ujr{l,Q) + 5{m.) Q-a.s. (51) 
The fact that Laig{xy^) converges to Laig{xv„) a.s. is clear from the proof of Proposition [6l □ 
Very similar to Theorem [9l we also have the following corollary. 
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Corollary 10. Let X be a stationary strongly mixing random field over a finite alphabet A and 
a probability measure Q. Let the prediction space D be either finite or bounded (with l{x,F) 
then being Lipschitz in its second argument). Then, there exists a sequence of scandictors 
{(\I',-F)„}, independent of Q, for which 

liminf -^LaigiXvJ<U{l,Q) + 5{m) Q - a.s. (52) 

n— >oo \Vn\ 

for any such Q and some 5{m) such that 6{m) — > as m ^ oo. Thus, when m — > oo, the 
performance of {{^ , F)n} equals the scandictability of the source, Q — a.s. 

3.6 Universal Scandiction for Individual Images 

The proofs of Theorems |4] and [9] relied on the stationarity, or the stationarity and mixing 
property, of the random field X (respectively) . In the proof of Theorem [H we used the fact 
that the cumulative loss of any scandictor {^,F) on a given block of data has the same 
expected value as that on any other block. In the proof of Theorem [9l on the other hand, 
the fact that the Cesaro mean of the losses on finite blocks converges to a single value, the 
expected cumulative loss, was used. 

When X is an individual image, however, the cumulative loss of the suggested algorithm may 
be higher than that of the best scandictor in the scandictors set since restarting a scandictor 
at the beginning of each block may result in arbitrarily larger loss compared to the cumulative 
loss when the scandictor scans the entire data. Compared to the prediction problem, in the 
scandiction scenario, if the scanner is arbitrary, then different starting conditions may yield 
different scans (i.e., a different reordering of the data) and thus arbitrarily different cumulative 
loss, even if the predictor attached to it is very simple, e.g., a Markov predictor. It is expected, 
however, that when the scandictors have some structure, it will be possible to compete with 
finite sets of scandictors in the individual image scenario. 

In this subsection, we suggest a basic scenario under which universal scandiction of indi- 
vidual images is possible. Further research in this area is required, though, in order to identify 
larger sets of scandictors under which universality is achievable. As mentioned earlier, since 
the exponential weighting algorithm used in the proofs of Theorems H] and [9] applied only 
block-wise scandictors, i.e., scandictors which scan every block of the data separately from all 
other blocks, stationarity or stationarity and ergodicity of the data were required in order to 
prove its convergence. Here, since the data is an individual image, we impose restrictions on 
the families of scandictors in order to achieve meaningful results (this reasoning is analogous 
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to that described in |23l Section I-B] for the prediction problem). The first restriction is that 
the scanners with which we compete are such that the actual path taken by each scanner when 
it is applied in a block-wise order has some kind of an overlap (in a sense which will be defined 
later) with the path taken when it is applied to the whole image. The second restriction is 
that the predictors are Markovian of finite order (i.e., the prediction depends only on the last k 
symbols seen, for some finite k). Note that the first restriction does not restrict us to compete 
only with scandictors which operate in a block- wise order, only requires that the excess loss 
induced when the scandictors operate in a block-wise order, compared to operating on the 
entire image, is not too large, if, in addition, the predictor is Markovian. 

The following definition, and the results which follow, make the above requirements precise. 
For two scanners ^ and ^' for the data array xb, define NB^xi^B,'^ ,^') as the number of 
sites in B such that their immediate past (context of unit length) under ^ is contained in the 
context of length K under ^' , namely, 

Nb,k{xb,^,^') = \{l<i<\B\: 3i<j<\B\,k<K (^i,**-i) = ■ (53) 

Note that in the above definition, a "context" of size w for a site in B refers to the set ofw sites 
which precede it in the discussed scan, and not their actual values. When {^n} is a sequence 
of scanners, where is a scanner for Vn, it will be interesting to consider the number of sites 
in B G Vn2, where B is an ni x rii rectangle, ni < n2, such that their immediate past under 
(applied to V^j) is contained in the context of length K under (applied to B), that is 

NB,KixB, '^n2,'^ni) = < i < \B\ : ^l<j<\Vn2\,k<K (^n2,i, ^n2,i-l) = (^ni ,j , ,j-fc) } , 

(54) 

where ^n.,i is the i'th site the scanner visits. The following proposition is proved in 
Appendix IA.3I 

Proposition 11. Consider two scanners ^ and for B such that for any individual image 
Xb we have 

= 1 - o( W (55) 

\B\ 

Then, for any xb, 

L(^^,^pKw,opt){xB) < L(^^^p^,opt){xB) + 0{\B\){K + l)'^~^lmax, (56) 

where for each scandictor F'^'°p^), denotes the optimal w-order Markov predictor for 

the scan ^. 
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Note that in order to satisfy the condition in (j55p for any array xb, it is hkely (but not 
a compulsory) that both ^ and ^' are data-independent scans. However, they need not be 
identical. If, for example, ^' is a raster scan from left to right, and ^' applies the same left to 
right scan, but with a different ordering of the rows, then the condition is satisfied for any xb- 

The result of Proposition [11] yields the following corollary, which gives sufficient conditions 
on the scandictors sets under which a universal scandictor for any individual image exists. The 
proof can be found in Appendix IA.4[ 

Corollary 12. Let {J-n}, \^n\ = A < oo, be a sequence of scandictor sets, where 
J^n = {(^n' (^n' ^^)^ ■ ■ ■ > (^ni F^)} ^ '^f scandictors for Vn- Assume that the pre- 
dictors are Markov of finite order w, the prediction space D is finite, and that there exists 
m{n) = o{n) (yet m{n) — > oo as n —> ooj such that for all 1 < i < X, n, and xv„ we have 

^ = 1-0 [m{n) ) , (57) 

2 

where B^(^j^-j is any one of the sub-blocks of size m{n) x m{n) ofVn- Then, there exists 

a sequence of scandictors {(^, F)n} such that for any image x 

where the expectation in the l.h.s. of ()58l) is due to the possible randomization in (^,F)„. 



Although the condition in (I57p is limiting, and may not be met by many data-dependent 
scans. Corollary [12] still answers on the affirmative the following basic question: do there exist 
scandictor sets for which one can find a universal scandictor in the individual image scenario? 
For example, by Corollary 1121 if the scandictor set includes all raster- type scans (e.g., left-to- 
right, right-to-left, up-down, down-up, diagonal, etc.), accompanied with Markov predictors of 
finite order, then there exists a universal scandictor whose asymptotic normalized cumulative 
loss is less or equal than that of the best scandictor in the set, for any individual image x. The 
condition in (|57p is also satisfied for some well-known "self-similar" space filling curves, such 
as the Sierpinski or Lebesgue curves [39] . 



4 Bounds on the Excess Scandiction Loss for Non- 
Optimal Scanners 

While the results of Section [3] establish the existence of a universal scandictor for all stationary 
random fields and bounded loss function (under the terms of Theorem [8]), it is interesting to 
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investigate, from both practical and theoretical reasons, what is the excess scandiction loss 
when non-optimal scanners are used. I.e., in this section we answer the following question: 
Suppose that, for practical reasons for example, one uses a non-optimal scanner, accompanied 
with the optimal predictor for that scan. How large is the excess loss incurred by this scheme 
with respect to optimal scandiction? 

For the sake of simplicity, we consider the scenario of predicting the next outcome of a 
binary source, with D = [0, 1] as the prediction space. Hence, / : {0, 1} x [0, 1] ^ M is the loss 
function. Furthermore, we assume deterministic scanner (though data-dependent, of course). 
The generalization to randomized scanners is cumbersome but straightforward. 

Let (j)i denote the Bayes envelope associated with Z, i.e., 

Up)= min [(l-p)Z(0,(?)+pZ(l,q)]. (59) 

g6[o,i] 



We further define 



ei = min max |a/ib(p) + /? — 0/ (p) | , (60) 

",/9 0<p<l 



where /if,(p) is the binary entropy function. Thus ei is the error in approximating (j)i{p) by the 
best afhne function of hh{p). For example, when I is the Hamming loss function, denoted by 
Ih, we have ei^ = 0.08 and when / is the squared error, denoted by Is, e/^ = 0.0137. For the 
log loss, however, the expected instantaneous loss equals the conditional entropy, hence the 
expected cumulative loss coincides with the entropy, which is invariant to the scan, and we 
have = 0. To wit, the scan is inconsequential under log loss. 

Although the definitions of (j)i{p) and e/ refer to the binary scenario, the results below 
(Theorem 1131 and Propositions 1141 and I15p hold for larger alphabets, with e/ defined as in (j60p . 
with the maximum ranging over the simplex of all distributions on the alphabet, and h{p) 
(replacing hh{p)) and 4>i{p) denoting the entropy and Bayes envelope of the distribution p, 
respectively. 

Let be any (possibly data dependent) scan, and let -EQ^|^L(^^^opt)(XB) denote the 
expected normalized cumulative loss in scandicting Xb with the scan ^ and the optimal 
predictor for that scan, under the loss function /. Remembering that U{1,Qb) denotes the 
scandictability of Xb w.r.t the loss function I, namely, U{1,Qb) = '^^i^ EQg-^L(^q, popt^i{XB), 
our main result in this section is the following. 



Theorem 13. Let Xb be an arbitrarily distributed binary field. Then, for any scan ^, 

< 2ei. (61) 



EQBT^^L^q,^popt){XB) - U{1,Qb) 
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That is, the excess loss incurred by applying any scanner ^, accompanied with the optimal 
predictor for that scan, with respect to optimal scandiction is not larger than 2e;. 

To prove Theorem 1 13( we first introduce a prediction result (i.e., with no data reordering) 
on the error in estimating the cumulative loss of a predictor under a loss function I with the 
best affine function of the entropy. We then generalize this result to the multi-dimensional 
case. 

Proposition 14. Let be an arbitrarily distributed binary n-tuple and let EL°^^{X^) denote 
the expected cumulative loss in predicting X" with the optimal distribution- dependent scheme 
for the loss function I. Then, 

ai-H{X^)+f3i--EL°^\X^) < 



n ' ' n ' 

where ai and Pi are the achievers of the minimum in (|6U|) . 

Proof. Let ai and Pi be the achievers of the minimum in (|60p . We have, 



(62) 



ai-H{X'^) + Pi--EL';P\X": 
n 



1 

n 



t=l X* 

[-aiP{xt\x'-^) log Pixt\x'~^) + Pixt\x'-^)Pi - Pixt\x'-^)lixt, F°p\x'-^)) 

n 

-EE ^(^*"') hh{p(.-\x'~')) +Pi- MP{-\x'~'): 

^ t=l 

1 

^ - E E ^(^*"') \c^MP{-\x'-')) + Pi- MP{-\^'-'))\ 

t=l 

1 " 

< - E E ^(^*~') max \aihh{p) + Pi - Mp)\ 



(a) 



t=l 



max \aihb{p) + Pi - 4>i{p)\ 
p 



where (a) is by the definition of (pi{-) and the optimality of F°^* with respect to I 



(63) 



□ 



The following proposition is the generalization of Proposition [TH to the multi-dimensional 



case. 



Proposition 15. Let Xb be an arbitrarily distributed binary random field. Then, for any scan 

<eu (64) 



(^ir^]^H{XB) + Pi - EQg—L(^^^popt){XB] 



\B\ 
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where ai and (3i are the achievers of the minimum in (j60p . 



For data-independent scans, the proof follows the proof of Proposition [T3] verbatim by ap- 
plying it to the reordered |i?|-tuple Xq/-^ , ■ ■ ■ , Xii,^g^ and remembering that H{Xb) = H{Xi^^ Xi^^ 
For data-dependent scans, the proof is similar, but requires more caution. 

Proof of Proposition I j5i Let a; and [ii be the achievers of the minimum in ()60p . For a given 
data array xb, ^i, ^2(3^'i'i)5 • • • , ^|_b|(2;*'^'"^) are fixed, and merely reflect a reordering oi xb 
as a I I -tuple. Thus, 



(^it^^H{Xb) + I3i- Eqjj—L{^,^f°v^){Xb 



B 



\B\ 



1^5] -a«P(xB)logP(xB)-P(xB)^/(x*„Fr*(x**-)) 

' ' y i=i J 

. ( \B\ \B\ \ 

— Y, -a«P(xB)^logP(x^Jx*-i)-P(xB) +A 

I ' \ t=l t=l I 



(65) 



1 

= lbt ^ ^ ^^"^^^ ( " ^^"^^^ '"^"'"^ + " H^*- 

' ' t=l 

Fix t = to in the sum over t. Consider all data arrays xb such that for a specific scanner ^ 
we have 

1^1, ^2(x>i,J, . . . , ^'t(,_i(x^,, . . .,x^^^_S\ = /(^), (66) 

where C -B is a fixed set of sites, and {xi^,^, . . . ,x,j,j^_j) = a, for some a G {0, 1}*"^^. In 
this case, 'I't(x**o~^) is also fixed, and since the term in the parentheses of (j65p depends only 
on /, a and x^^^^ , we have 



Y Pi^n ( - «z log P(x^,^ |x*'o-i) + - /(x^^^ , (x**o-i ))^ 

J]] P{xi = a) J] P(xvi,,Jx/ = a) - a, log Pix^i,,^ |x/ = a) A - ^2;*,^ , F°^\a))^ . 

(67) 



/,a 



x*j^e{o,i} 
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Consequently, 



OiiT-^H{XB) + Pi - EQg—L(^y^^popt^{XB) 



\B\ ' " ^^151 

1 



\xi = al 



i=i /,a 



= a)^ - ai\ogP{x^,^\xi = a) + A - /(x^^ , F°^*(a)) 

a;*^e{0,l} 



-i- ^ ^(x/ = a) |ai/ib(P(-|x/ = a)) + A - </'/(^(-|x/ = a))| 



t=i i,a 

\B\ 



< -T^ ^ ^ P{xi = a) max \aihb{p) + Pi - 4>i{p)\ 



t=i i,a 
max \aihb{p) + Pi - 



(68) 



□ 



It is now easy to see why Theorem [13] holds. 

Proof of Theorem\13[ The proof is a direct application of Proposition 1151 as for any scan ^, 
1 



^QbT^^{^,f°p^^){^b) - U{1,Qb) 



< 
<2ei. 



--^H{Xb) + Pi - EQj^—L(^,^popt){XB) 



+ 



^H{Xb) + Pi-U{1,Qb] 



(69) 



□ 

At this point, a few remarks are in order. For the bound in Theorem [13] to be tight, the 
following conditions should be met. First, equality is required in (I64p for both the scan ^ 
and the optimal scan (which achieves U{1,Qb))- It is not hard to see that for a given scan ^, 
equality in ()64p is achieved if and only if P(-|x**~i) = p for all x"^*"^, where p is a maximizer 
of ([60]) . However, for ([6T]l to be tight, it is also required that 



' H{Xb) + Pi - EQg—L^^^popt){XB) 



B 



ai 
\B\ 



H{Xb)-Pi + U{1,Qb) 



(70) 



so the triangle inequality is held with equality. Namely, it is required that under the scan 

for example, P{-\x^*-^) = p for all x**-!, where p is such that aihi,{p) + Pi — 4'i{p) = 
yet under the optimal scan, say P(-|x*'-i) = p' for all where p' is such that 
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oiihh{p') + Pi — 4>i{p') = — e«- Clearly this is not always the case, and thus, generally, the bound 
in Theorem 1131 is not tight. Indeed, although under a different setting (individual images), 
in subsection 14.11 we derive a tighter upper bound on the excess loss for the specific case of 
Hamming loss. Using this bound, it is easy to see that the 0.16 bound given here (as ei = 0.08 
for Hamming loss) is only a worst case, and typically much tighter bounds on the excess loss 
apply, depending on the image compressibility. For example, consider a 1st order symmetric 
Markov chain with transition probability 1/4. Scanning this source in the trivial (sequential) 
order results in an error rate of 1/4. By [22], this is indeed the optimal scanning order for this 
source, as it can be represented as an autoregressive process whose innovation process has a 
maximum entropy distribution with respect to the Hamming distance. The "odds-then-evens" 
scarQ, however, which was proved useful for this source but with larger transition probabilities 
(larger than 1/2, [22]), results in an error rate of 5/16, which is 1/16 away from the optimum. 
It is not hard to show that different transition probabilities result in lower excess loss. 

4.1 Individual Images and the Peano-Hilbert Scan 

In this subsection, we seek analogous results for the individual image scenario. Namely, the 
data array xb has no stochastic model. A scandictor (^,F), in this case, wishes to minimize 
the cumulative loss over that is, j7')(xy„) as defined in ([3|). 

In this setting, although one can easily define an empirical probability measure, the in- 
variance of the entropy H[X^) to the reordering of the components, which stood at the heart 
of Theorem 113^ does not hold for any reordering (scan) and any finite n. Thus, we limit the 
possible set of scanners to that of the finite state machines discussed earlier. Moreover, in 
the sequel, we do not bound the difference in the scandiction losses of any two scandictors 
from that set, only that between the Peano-Hilbert scan (which is asymptotically optimal for 
compression of individual images [H]) and any other finite state scanner (both accompanied 
with an optimal Markov predictor), or between two scans (finite state or not) for which the 
FS compressibility of the resulting sequence is the same. 

We start with several definitions. Let be a scanner for the data array xb- Let x\ 
be the sequence resulting from scanning xb with '^b- Fix k < \B\ and for any s G {0, 1}*^+^ 

^An "odds-then-evens" scanner for a one-dimensional vector x", first scans all the sites with an odd index, in an 
ascending order, then all the sites with an even index. 
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define the empirical distribution of order k + 1 as 

= ^gpi l{^ < ^ < 1^1 ■■ <-k = s}| . (71) 

The distributions of lower orders, and the conditional distribution are derived from P^^^(s), 
i.e., for s' G {0, 1}'^ and x G {0, 1} we define 

P'^^^'is') = PllWs' M) + Pll\[s\l]) (72) 

and 



n 

where 0/0 is defined as 1/2 and [•, •] denotes string concatenation|£| Let H^^{X\X^) be the 
empirical conditional entropy of order A;, i.e.. 



Hll\X\X^) = - Yl P'^fi') E P^l\As)\ogPl+^\x\s). (74) 
se{o,i}''- xe{o,i} 

Finally, denote by F^'°^^ the optimal k-th. order Markov predictor, in the sense that it minimizes 
the expected loss with respect to P'^^{-\-) and x^^'. The following proposition is the individual 
image analogue of Proposition [151 

Proposition 16. Let xb he any data array. Let j^^L(^^^^pk,opt^{xB) denote the normalized 
cumulative loss of the scandictor {'^ BtF^'°^^), where b is any (data dependent) scan and 
pk,opt optimal k-th order Markov predictor with respect to "^b cif^d I- Then, 



aiHl+\X\X'') + A - ^L(vtB,F^.°p*)(^B) 



kl 



where ai and Pi are the achievers of the minimum in (|60p . 



Since xb is an individual image, x^ = "^b{xb) is fixed. In that sense, the proof resembles 
that of Proposition [m and we write xt for the value of x at the t-th. site ^b visits. On the 
other hand, since the order of the predictor, k, is fixed, we can use Pq,~^^i-) and avoid the 
summation over the time index t. The complete details can be found in Appendix IA.5[ 

The bound in Proposition [16] differs from the one in Proposition [15] for two reasons. First, 
it is only asymptotic due to the 0{k/\B\) term. Second, the empirical entropy hI^'^^{X\X'') 
is not invariant to the scanning order. This is a profound difference between the random and 



^Note that defining P^+^(x|s'), s' £ {0, 1}'' as '^"^ is not consistent since generally (s') 7^ -P^+^([s', 

^'It'd^M])- 
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the individual settings, and, in fact, is at the heart of |14j . In the random setting, the chain 
rule for entropies implies invariance of the entropy rate to the scanning order. This fact does 
not hold for a k-th. order empirical distribution of an individual image, hence the usage of the 
Peano-Hilbert scanning order|j| Consequently, we cannot directly compare between any two 
scans. Nevertheless, Proposition [16] has the following two interesting applications, given by 
Proposition [T7] and Corollary [TSl 

For vl/ = where ^ n is a scan for Vn, and an infinite individual image x define 

= limsup pfc,opt)(xi/„) (76) 



and 



L^{x) = lim Llix). (77) 



Proposition [T7] relates the asymptotic cumulative loss of any sequence of finite state scans 
^ to that resulting from the Peano-Hilbert sequence of scans, establishing the Peano-Hilbert 
sequence as an advantageous scanning order for any loss function. 

Proposition 17. Let x be any individual image. Let PH denote the Peano-Hilbert sequence of 
scans. Then, for any sequence of finite state scans ^ and any loss function I : {0, 1} x [0, 1] — > 
R, 

Lph{x) <L^{x) + 2ei. (78) 

Before we prove Proposition \T7\ define the asymptotic k-th. order empirical conditional 
entropy under {^n} as 

H'^+\x) = limsup Hl+\X\X'') (79) 



and further define 



H^,{x) = lim H'l'^^ix). (80) 



The existence of H\^{x) is established later in the proof of Proposition 1171 where it is also 
shown that this limit equals lim^^oo 1™ sup„__^oQ (X^). By [El Theorem 3], the latter 

limit is no other than the asymptotic finite state compressibility of x under the sequence of 
scans ^, namely, 

lim limsupi^ljX'^) = p{^{x)) 

= lim limsup/3E(^)(^'„(xy„)), (81) 

^ n — ♦oo 

where Pe{s){^i) minimum compression ratio for x" over the class of all finite state 

encoders with at most s states \lf>\ eq. (l)-(4)]. We may now introduce the following corollary. 



^Yct, the Pcano-Hilbcrt is by no means the only optimal scan. We Elaborate on this issue later in this section. 
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Corollary 18. Let and ^2 be any two sequences of scans such that = Hiii^{x) (in 

particular, if both ^1 and ^2 o-f^ finite state sequences of scans they result in the same finite 
state compressibility). Then, 

\L^,{x)-L^,{x)\<2ei. (82) 
for any loss function I : {0, 1} x [0, 1] M. 

For a given sequence of scans the set of scanning sequences ^' satisfying H\i,{x) = Hi^ii {x) 
is larger than one might initially think. For example, a close look at the definition of finite 
state compressibility given in [15] shows that the finite state encoders defined therein allow 
limited scanning schemes, as an encoder might read a large data set before its output for that 
data set is given. Thus, a legitimate finite state encoder in the sense of [15] may reorder 
the data in a block (of bounded length, as the number of states is bounded) before actually 
encoding it. Consequently, for any individual sequence x one can define several permutations 
having the same finite state compressibility. In the multidimensional scenario this sums up to 
saying that for each scanning sequence ^ there exist several different scanning sequences ^' 
for which H^[x) = H^'{x). 



Proof of Proposition [T7[ For each n, is a scanner for Vn- Thus, by Proposition [161 we have 

1 



Taking the limsup as n ^ 00 yields 

ai lim sup Hl+^ (X|X'=) + A - i| (x 



hi 

<ei + — — , (83) 



< ei. (84) 



For a stationary source, it is well known (e.g., [IHl Theorem 4.2.1]) that Vunj^^oo H {X}^\X^ ^) 
exists and in fact 

lim H{Xk\Xl~^) = lim -H{X^). (85) 

To this end, we show that the same holds for empirical entropies. We start by showing that 
limsup„^oj3 h!^~^^{X\X'') is a decreasing sequence in k. Since conditioning reduces the entropy, 
it is clear that H'1+^{X\X>') < H^+^{X\X''-^), where both are calculated using -P^+^(-)- 
However, the above may not be true when H^~^^(X\X'^~^) is replaced by H^^{X\X^~^), as 
the later is calculated using P^^^(-). Nevertheless, using a simple counting argument, it is not 
too hard to show that for every k, < j < k and s G {0, 1}% where < i < j, we have 
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Thus, by the continuity of the entropy function, we have 

limsup^^+H^I^^) < hmsupi7^+^(X|X^-i) 

= nmsupi?|jX|X'=-i), (87) 

n— >oo 

hence Umsup„_»oo H^^{X\X''^^) is decreasing in k. Since it is a non negative sequence, H^,{x) 
as defined in (1801) exists and we have 



<ei. 



We now show that indeed -ff<i-(x) equals p{'^{x)) for every sequence of finite state scans ^, 
hence when ^' is a sequence of finite state scans the results of [l^ can be applied. The method 
is similar to that in [401 Theorem 4.2.1]), with an adequate handling of empirical entropies. 
By m, 

limsup^i?|jX'=) = hmsupi J]^|jX,|Xi-i) 

n— too K n^oo K . 

1=1 

1 ^ 

= lmisup-Y,HhAX,\Xl~'). (89) 

n—^oa . -, 
1=1 

But the sequence limsu])^^^ H\,^{Xi\Xl~^) converges to Hq,{x) as i ^ oo, thus its Cesaro 
mean converges to the same limit and we have 

H<!,{x) = lim limsup^^|^(X'') 

= P(^(2;))- (90) 

Consider now the Peano-Hilbert sequence of finite state scans, denoted by PH. Let p{x) 
denote the (finite state) compressibility of x as defined in |14l eq. (4)]. For any other sequence 
of finite state scans ^ we have 

Hph{x) < p{x) 

< H^{x), (91) 

where the first inequality is by [14^ eq. (9) and (16)] and the second is straightforward from 
the definition of p{x). Finally, 

Lph{x) < ei + Pi + aiHpHix) 

< ei + I3i + aiH^{x) 
(b) 

< 2q + L^(x), (92) 
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where (a) and (6) result from the apphcation of (j88p to the sequences PH and ^ respectively. 

□ 

The proof of Corollary [TH] is straightforward, using (j88|) for both ^'i and ^'2 and the triangle 
inequality. 

4.1.1 Hamming Loss 

The bound in Proposition [T7] is valid for any loss function / : {0, 1} x [0, 1] — > M. When / is 
the Hamming loss, the resulting bound is 

L^™"3(x) < + 0.16, (93) 

for any other finite state sequence of scans, namely, a uniform bound, regardless of the com- 
pressibility of X. However, using known bounds on the predictability of a sequence (under 
Hamming loss) in terms of its compressibility can yield a tighter bound. 

In [H], Feder, Merhav and Gutman proved that for any next-state function g ^ Gg, where 
Gg is the set of all possible next state functions with s states, and for any sequence 

K9,x'l) > h^\p{g,xl)), (94) 

where /i((7, •) {p{g, •)) is the best possible prediction (compression) performance when the next 
state function is g. Consequently, for any two finite-state scans and ^'^ for xv„, 

m:m.ii{g,^\{xv„)) - min ^(5, (xyj) 
g&Ga g&Gs 

< min ]-p{g,^l{xvJ) - min h'^ {p{g,^l{xvj) 

= I mm p{g, ^i{xv J) -h~^ ( mm p{g,^l{xvj)] ■ (95) 

Taking ^'^ to be the Peano-Hilbert scan, the results of [H] imply that 

min p{g, 'ifpHixvJ) < min p{g, "^nixyj) + en,s (96) 
g&Gs g^Gs 

for any finite-state scan where en,s satisfies hm^^oo hm supj^_>Q^ €n,s — 0' Hence, 

mm p{g,'^PH {xv J) - mm p{g,'i'{xvj) 

geGs g&Ga 

< ^ mmp(5r,^'p/^(j;y„)) - /ij^^ ( mm p{g,'i/ puixy J) - en,s) ■ (97) 
Z geGs \36Gs / 

Taking the limits limsup„^oQ and then s — > 00 implies the following proposition. 
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Proposition 19. Let x he any individual image. Let PH denote the Peano-Hilheri sequence 
of scans. Then, under the Hamming loss function, for any sequence of finite state scans ^ we 
have 

Lph{x) < L^{x) + ^pix) - h-\pix)), (98) 
where p{x) is the compressibility of the individual image x. 

In other words, the specific scandictor composed of the Peano-Hilbert scan followed by the 
optimal predictor, adheres to the same asymptotic bounds (on predictability in terms of the 
compressibility) as the best finite-state scandictor. Figure [2] plots the function \p — h^^{p). 
The maximum possible loss is 0.16, similar to the bound given in Proposition 1 171 yet this value 
is achieved only when the image's FS compressibility is around 0.75 bits/symbol. For images 
which are highly compressible, for example, when p < 0.1 the resulting excess loss is smaller 
than 0.04. 

Upper bound on the redundancy in using the Peano-Hiibert scan. 
0.18 I 1 1 1 1 1 1 1 1 1 1 
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p; the compressibiiity of the image 

Figure 2: A plot of lp-h'\p). The maximum redundancy is not higher than 0.16 in worst 
but will be much lower for more compressible arrays. 

5 Conclusion 

In this paper, we formally defined finite set scandictability, and showed that there exists a 
universal algorithm which successfully competes with any finite set of scandictors when the 
random field is stationary. Moreover, the existence of a universal algorithm which achieves the 
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scandictability of any spatially stationary random field was established. We then considered 
the scenario where non-optimal scanners are used, and derived a bound on the excess loss in 
that case, compared to optimal scandiction. 

It is clear that the scandiction problem is even more intricate than its prediction analogue. 
For instance, very basic results in the prediction scenario do not apply to the scandiction case 
in a straightforward way, and, in fact, are still open problems. To name a few, consider the 
case of universal scandiction of individual images, briefly discussed in Section 13.61 Although 
the question whether there exists a universal scandictor which competes successfully with any 
finite set of scandictors on any individual image was answered negatively in Section 13.11 it is 
interesting to discover interesting sets of scandictors for which universal scandiction is possible. 
The sequential prediction literature also includes an elegant result |41j on the asymptotic 
equivalence between finite state and Markov predictors. We conjecture that this equivalence 
does not hold in the multi-dimensional scenario for any individual image. Finally, the very 
basic problem of determining the optimal scandictor for a given random field X with a known 
probability measure Q, is still unsolved in the general case. 

It is also interesting to consider the problems of scanning and prediction, as well as fil- 
tering, in a noisy environment. These problems are intimately related to various problems in 
communications and image processing, such as filtering and denoising of images and video. As 
mentioned in Section [U these problems are the subject of [27] . 



A Appendixes 

A.l Proof of Proposition [5] 

For the sake of simplicity, we suppress the dependence of m{n) in n. Define Wi = X^j=i e~''^J'* . 
We have 




> logmaxe ^'(■f^'+i)'' - logA 



j 

-r/ min (;^+i)2 - log A 



-r]Lrain " logA. 



(99) 
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Moreover, 



iog = log r z 

= iogx:^'«(j|{^,,ak)e"''^^^''^ 

< -r?^P. (i|{^,.},"=i) + (100) 

where the last inequality follows from the extension to Hoeffding's inequality given in [33] and 
the fact that —7]Lj{x^) is in the range [—rjm'^lmax,^]- Thus, 

(i^^^-i ^^^^ 

log = > log 

^ Wo H 



i=0 i=l 

,4;2 „2(T^ I i\2 



Finally, from ()99p and pOlh . we have 



r _r < log A , m44,,r?(K + l)2 

^alq J-'min ~r 
?7 



~ t] 8 

The bound in (I14p easily follows after optimizing the right hand side of (I102p with respect to 
ry. 

A. 2 Proof of Proposition [6] 

Let 6{n) be some sequence satisfying 5{n) — > as n — > oo. Define the sets 

A„ = <^ a; : > ^(n ) ^ , (103) 

where {Q, P) is the probability space. We wish to show that 

P ( limsupAJ = 0, (104) 

\ n—>oo / 

that is, P{An i.o.) = 0. Let (^',-F)fc be the scandictor chosen by the algorithm for the k + 1 
block, x^. Define 

Zk = L(^^,F),ix') - E [Lt^^^F),{x^)\{Lj,k]]=i} , (105) 
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where the expectation is with respect to Pk yj\{Lj,k}j=ij- Namely, the actual randomization 
in Zk is in the choice of {'^,F)k- Thus, {Zk} are clearly independent, and adhere to the 
following Chernoff-like bound [331 sq. 33] 



(106) 



for any e > 0. Note that 

Zk = LaigixvJ - LaigixvJ, (107) 

k=l 

thus, together with eq. ([H]), we have 
P (^LaigixvJ - Lminixvn) > (K + ifc + m{n + m)v^log A^^^ 

r 2(i^ + l)2e2^ ^ ^ 

Set 

(K + l)2e + m(n + m)Vlog 
S{n) = 2 ^. (109) 

Clearly 5{n) ^ as n — > oo for any m(n) = o(n) satisfying m(n) — > oo. For the summability 
of the r.h.s. of (jlOSp we further require that m(n) = o (n^/^). The proposition then follows 
directly by applying the Borel-Cantelli lemma. 

A. 3 Proof of Proposition [TT] 

We show, by induction on w, that the number of sites in B for which the context of size w (in 
terms of sites in B) under the scan ^ is not contained in the context of size Kw under the 
scan ^t' is at most o{\B\){K + l)'""^. This proves the proposition, as the cumulative loss of 
(^^' ^ pKw,opt^ is no larger than o{\B\){K -\- l)^~^lmax on these sites, and is at least as small as 
that of (^', F"''°P*) on all the rest \B\ - o{\B\){K + l)"""! sites. 

For w = 1 this is indeed so, by our assumption on ^ and ^' - i.e., (|55p . We say that a site 
in B satisfies the context-condition with length u; — 1 if its context of size i — 1, 1 < i < w, 
under the scan ^ is contained in its context of size K{i — 1) under the scan ^' . Assume 
that the number of sites in B which do not satisfy the context-condition with length w — 1 
is at most o{\B\)(K + 1)'""^. We wish to lower bound the number of sites in B for which 
the context-condition with length w is satisfied. A sufficient condition is that the context- 
condition with length — 1 is satisfied for both the site itself and its immediate past under 
^. If the context-condition with length ti; — 1 is satisfied for a site, its immediate past under 
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^ is contained in its past of length K under Thus, if the context-condition of length w — 1 
is satisfied for a given site, and for all K preceding sites under then it is also satisfied for 
length w. In other words, each site in B which does not satisfy the context-condition with 
length w — 1 results in at most K + 1 sites (itself and K more sites) which do not satisfy the 
context-condition with length w. Hence, if our inductive assumption is satisfied for w — 1, 
the number of sites in B which do not satisfy the context-condition with length w is at most 
o{\B\){K + l)"'~2(/s: + 1), which completes the proof. 



A. 4 Proof of Proposition [12 



The proof is a direct application of Propositions [5] and [TTJ For each n, define the scandictors 
set 

I Kw 



,«,F^-'2),...,«,,F^-'l^l'-^' ),. 



(*^, F^-'i), F^-'^), . . . , (v?^, F^-.l^l'-^l"™)!, (110) 

where {F^^''^} -^[ is the set of all Markov predictors of order Applying the results 

of Proposition [5] to {J-'n}-, we have, for any image x and all n, 



EL, 



where min(^ L(,j,^^)(xy„) is the cumulative loss of the best scandictor in J^m{n) oper- 

ating block-wise on However, by Proposition II 11 for any 1 < i < A, x and n, 

n 



min Li^^ p\{xv„) < m{n) {n + m{n)) J log X\D\\^\ 



Kw ''max 



(111) 



min ELf^i 

l<j<|D|l-4|^"' " 



, ^^pKm,,){xvJ < EL^^.^^p.){xvJ + o (m(n)2) {K + 1^-^. 



m{n) 



(112) 



Note that 

min F){xv„) = min min EL,^i pK-w,j\{xv„) 

< min I ,^.)(xyj + o {m{nf) {K + 1)-^/, 



l<i<A 



n 




m{n) 





min EL<^ f){xvJ + o (m{nf) {K + 1)"'"^/, 



n 



m{n) 



(113) 



^Alternatively, one can use one universal predictor which competes successfully with all the Markov predictors of 
that order. 
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Thus, together with (jllip . we have 



< m{n) {n + m{n)) Jb^A|D|MJ^^ + o {m{n)^) {K + l)'^-H 
which completes the proof since \D\^ \ A\^ K and w are finite. 



n 



m[n] 



(114) 



A. 5 Proof of Proposition [16 



Similar to the proof of Proposition [T^ we have, 



1 I ' \ 

' ' \i=l t=k+l J 



aiH'^-l\X\X') + (3i 

k 



' ' t=i \ I 1/ I I t=k+i 



< 



\B\ 



t=k+l 



+ 



\B\ 



(115) 



Since the order of the predictor is fixed, we can use the definition of -P^+^(s) ans sum over 
s G {0, 1}'=+^ instead of t. Thus, 

aiH'lf{X\X'') + pi - 2^L(^^^pfc,<,pt)(xB) 



< 



aiH'^-'^HxlX') + Pi - P^tHs)l{sk+i,F'^°^\s'i)) 

sG{0,1}*^+i 



+ 



\B\ 



E E ^K'(^k')(-«/iog^'^:'(^i^')+A-/(x,F^-'°^*(s')) 

s'G{0,l}^' xG{0,l} 



\B\ 



< E P^^,Hs')^^^\aMp) + Pi-Mp)\ ^^""^ 

hi 

, "'''max 



+ 



\B\ 



\B\ 



(116) 
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